Introduction

Categorical Feature Encoding Challenge: Kaggle Link

In this project the data file have lots of categorical features. We are asked to predict a binay target based on other features using various techniques of feature encodings.

Data Description

# id features
'id',

# binary features
'bin_0', 'bin_1', 'bin_2', 'bin_3', 'bin_4',

# nominal features
'nom_0', 'nom_1','nom_2', 'nom_3', 'nom_4',
'nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9',

# ordinal features       
'ord_0', 'ord_1', 'ord_2', 'ord_3', 'ord_4', 'ord_5',

# cyclical features
 'day', 'month',

# binary target
'target'

Types of Data

  • Continuous Data

    • Numbers like 0.0, 0.1, ..., 100.0 etc are continuous data.
    • We can use histogram binnings to encode continuous data.
    • We can groupby features and use various statistics such as mean, max etc.
  • Categorical Data

    • Binary data : Binary data has two values,e.g. 0/1, -1/1, True/False
    • Categorical data : Has few limited variables. E.g. 1-7 for weekdays.
    • Ordinal data : Has order in it. Eg. Movie review Bad,medium,Good.
    • Nominal data : Nominal values without order. E.g. USA,Nepal,Canada
  • Timeseries data : These can be continuous and discrete. They have time associated with them. When using train-test split we should not shuffle them.

Resources

Imports

In [4]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(color_codes=True)

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
sns.set(context='notebook', style='whitegrid', rc={'figure.figsize': (12,8)})
plt.style.use('ggplot') # better than sns styles.
matplotlib.rcParams['figure.figsize'] = 12,8

import os
import time

# random state
SEED=100
np.random.seed(SEED)

# Jupyter notebook settings for pandas
#pd.set_option('display.float_format', '{:,.2g}'.format) # numbers sep by comma
pd.options.display.float_format = '{:,}'.format # df.A.value_counts().astype(float)
from pandas.api.types import CategoricalDtype
np.set_printoptions(precision=3)

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100) # None for all the rows
pd.set_option('display.max_colwidth', 200)

import IPython
from IPython.display import display, HTML, Image, Markdown

print([(x.__name__,x.__version__) for x in [np, pd,sns,matplotlib]])
[('numpy', '1.17.4'), ('pandas', '0.25.3'), ('seaborn', '0.9.0'), ('matplotlib', '3.1.2')]
In [0]:
%%capture

ENV_BHISHAN = None

try:
    import bhishan
    ENV_BHISHAN = True
    print("Environment: Bhishan's Laptop")
except:
    pass


import sys
ENV_COLAB = 'google.colab' in sys.modules

if ENV_COLAB:
    # load google drive
    # from google.colab import drive
    # drive.mount('/content/drive')
    # dat_dir = 'drive/My Drive/Colab Notebooks/data/' 
    # sys.path.append(dat_dir)
    
    # pip install
    #!pip install pyldavis
    # !pip install hyperopt
    !pip install catboost
    !pip install shap
    #!pip install eli5
    #!pip install lime
    # !pip install category_encoders # TargetEncoder
    # !pip install optuna # hyper param opt
    
    # print
    print('Environment: Google Colaboratory.')

if ENV_COLAB:
    # update modules
    !pip install -U scikit-learn
    !pip install -U tqdm # tqdm needs restart run time.
    pass
In [2]:
# encoders
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction import FeatureHasher

# folding
from sklearn.model_selection import KFold

# pipeline
from sklearn.base import BaseEstimator, TransformerMixin


import scipy

# evaluation
from sklearn.metrics import roc_auc_score as auc

# boosting
import sklearn
import xgboost, lightgbm, catboost
from catboost import Pool, CatBoostClassifier

# extra modules
# import category_encoders
# from category_encoders import TargetEncoder

print([(x.__name__,x.__version__) for x in [xgboost,lightgbm, catboost,
                                            sklearn,scipy]])
[('xgboost', '0.90'), ('lightgbm', '2.2.3'), ('catboost', '0.20'), ('sklearn', '0.22'), ('scipy', '1.3.3')]

Load the Data

In [5]:
ifile = 'https://github.com/bhishanpdl/Project_Categorical_Feature_Encoding/blob/master/data/raw/train.csv?raw=true'
df = pd.read_csv(ifile)
df = df.astype(str) # make all string for catboost
print(df.shape)
df.head()
(300000, 25)
Out[5]:
id bin_0 bin_1 bin_2 bin_3 bin_4 nom_0 nom_1 nom_2 nom_3 nom_4 nom_5 nom_6 nom_7 nom_8 nom_9 ord_0 ord_1 ord_2 ord_3 ord_4 ord_5 day month target
0 0 0 0 0 T Y Green Triangle Snake Finland Bassoon 50f116bcf 3ac1b8814 68f6ad3e9 c389000ab 2f4cb3d51 2 Grandmaster Cold h D kr 2 2 0
1 1 0 1 0 T Y Green Trapezoid Hamster Russia Piano b3b4d25d0 fbcb50fc1 3b6dd5612 4cd920251 f83c56c21 1 Grandmaster Hot a A bF 7 8 0
2 2 0 0 0 F Y Blue Trapezoid Lion Russia Theremin 3263bdce5 0922e3cb8 a6a36f527 de9c9f684 ae6800dd0 1 Expert Lava Hot h R Jc 7 2 0
3 3 0 1 0 F Y Red Trapezoid Snake Canada Oboe f12246592 50d7ad46a ec69236eb 4ade6ab69 8270f0d71 1 Grandmaster Boiling Hot i D kW 2 1 1
4 4 0 0 0 F N Red Trapezoid Lion Canada Oboe 5b0f5acd5 1fe17a1fd 04ddac2be cb43ab175 b164b72a7 1 Grandmaster Freezing a R qP 7 8 0
In [6]:
df.dtypes
Out[6]:
id        object
bin_0     object
bin_1     object
bin_2     object
bin_3     object
bin_4     object
nom_0     object
nom_1     object
nom_2     object
nom_3     object
nom_4     object
nom_5     object
nom_6     object
nom_7     object
nom_8     object
nom_9     object
ord_0     object
ord_1     object
ord_2     object
ord_3     object
ord_4     object
ord_5     object
day       object
month     object
target    object
dtype: object
In [7]:
[ (c,df[c].nunique()) for c in df.iloc[:,1:-1] ]
Out[7]:
[('bin_0', 2),
 ('bin_1', 2),
 ('bin_2', 2),
 ('bin_3', 2),
 ('bin_4', 2),
 ('nom_0', 3),
 ('nom_1', 6),
 ('nom_2', 6),
 ('nom_3', 6),
 ('nom_4', 4),
 ('nom_5', 222),
 ('nom_6', 522),
 ('nom_7', 1220),
 ('nom_8', 2215),
 ('nom_9', 11981),
 ('ord_0', 3),
 ('ord_1', 5),
 ('ord_2', 6),
 ('ord_3', 15),
 ('ord_4', 26),
 ('ord_5', 192),
 ('day', 7),
 ('month', 12)]
In [0]:
# df_train.info()
In [0]:
# df_test.info()
In [0]:
# as given in data description, the data set does not have any nulls.

Target Distribution

In [11]:
target = 'target'
sns.countplot(df[target])
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f78ac20b438>

Useful Codes

In [0]:
def show_method_attributes(obj, ncols=7,start=None, inside=None):
    """ Show all the attributes of a given method.
    Example:
    ========
    show_method_attributes(list)
     """

    print(f'Object Type: {type(obj)}\n')
    lst = [elem for elem in dir(obj) if elem[0]!='_' ]
    lst = [elem for elem in lst 
           if elem not in 'os np pd sys time psycopg2'.split() ]

    if isinstance(start,str):
        lst = [elem for elem in lst if elem.startswith(start)]
        
    if isinstance(start,tuple) or isinstance(start,list):
        lst = [elem for elem in lst for start_elem in start
               if elem.startswith(start_elem)]
        
    if isinstance(inside,str):
        lst = [elem for elem in lst if inside in elem]
        
    if isinstance(inside,tuple) or isinstance(inside,list):
        lst = [elem for elem in lst for inside_elem in inside
               if inside_elem in elem]

    return pd.DataFrame(np.array_split(lst,ncols)).T.fillna('')

df_eval = pd.DataFrame({'Model': [],
                        'Description':[],
                        'Accuracy':[],
                        'Precision':[],
                        'Recall':[],
                        'F1':[],
                        'AUC':[],
                    })

Modelling

Train Test Split with Stratify

In [13]:
from sklearn.model_selection import train_test_split

Xtrain_orig, Xtest, ytrain_orig, ytest = train_test_split(
    df.drop(target,axis=1), 
    df[target],
    test_size=0.2, 
    random_state=SEED, 
    stratify=df[target])

df_Xtrain_orig = pd.DataFrame(Xtrain_orig, columns=df.columns.drop(target))
df_Xtest = pd.DataFrame(Xtest, columns=df.columns.drop(target))

print(df_Xtrain_orig.shape)
df_Xtrain_orig.head()
(240000, 24)
Out[13]:
id bin_0 bin_1 bin_2 bin_3 bin_4 nom_0 nom_1 nom_2 nom_3 nom_4 nom_5 nom_6 nom_7 nom_8 nom_9 ord_0 ord_1 ord_2 ord_3 ord_4 ord_5 day month
194882 194882 0 1 1 F Y Green Square Snake Russia Oboe 39647c92a 3ac1b8814 b042166d5 beacd1432 ff794dacf 2 Grandmaster Boiling Hot k J nX 7 9
84846 84846 0 0 1 F Y Green Triangle Dog India Theremin 35f65a9bf b1c48d202 3506befc5 4d7256d7f a49e41d63 1 Grandmaster Lava Hot g T UO 1 12
273779 273779 0 0 0 F N Red Polygon Cat China Theremin 3685a0904 d95501ac1 fb05522b9 ed4b0b59b 7d6ad1176 1 Novice Warm c Y ps 1 10
215432 215432 0 0 0 F Y Blue Star Lion Russia Piano 9ad6558d1 145d17afb 4323c4ab0 feb99cbd9 483ca632f 1 Novice Warm a O Fo 1 12
173196 173196 0 0 0 F Y Red Trapezoid Lion Finland Theremin f7821e391 5fa8beadb 4dcf33683 00405ddc2 0d3b2fe9b 1 Grandmaster Cold a Y Ml 3 3

Train Validation Split with Stratify

In [14]:
Xtrain, Xvalid, ytrain, yvalid = train_test_split(
    Xtrain_orig, 
    ytrain_orig,
    test_size=0.2, 
    random_state=SEED, 
    stratify=ytrain_orig)

df_Xtrain = pd.DataFrame(Xtrain, columns=df.columns.drop(target))
df_Xvalid = pd.DataFrame(Xvalid, columns=df.columns.drop(target))

print(df_Xtrain.shape)
(192000, 24)

Modelling catboost

https://catboost.ai/docs/concepts/python-reference_catboostregressor.html

iterations=None,
learning_rate=None,
depth=None,
l2_leaf_reg=None,
model_size_reg=None,
rsm=None,
loss_function='RMSE',
border_count=None,
feature_border_type=None,
per_float_feature_quantization=None,
input_borders=None,
output_borders=None,
fold_permutation_block=None,
od_pval=None,
od_wait=None,
od_type=None,
nan_mode=None,
counter_calc_method=None,
leaf_estimation_iterations=None,
leaf_estimation_method=None,
thread_count=None,
random_seed=None,
use_best_model=None,
best_model_min_trees=None,
verbose=None,
silent=None,
logging_level=None,
metric_period=None,
ctr_leaf_count_limit=None,
store_all_simple_ctr=None,
max_ctr_complexity=None,
has_time=None,
allow_const_label=None,
one_hot_max_size=None,
random_strength=None,
name=None,
ignored_features=None,
train_dir=None,
custom_metric=None,
eval_metric=None,
bagging_temperature=None,
save_snapshot=None,
snapshot_file=None,
snapshot_interval=None,
fold_len_multiplier=None,
used_ram_limit=None,
gpu_ram_part=None,
pinned_memory_size=None,
allow_writing_files=None,
final_ctr_computation_mode=None,
approx_on_full_history=None,
boosting_type=None,
simple_ctr=None,
combinations_ctr=None,
per_feature_ctr=None,
ctr_target_border_count=None,
task_type=None,
device_config=None,
devices=None,
bootstrap_type=None,
subsample=None,
sampling_unit=None,
dev_score_calc_obj_block_size=None,
max_depth=None,
n_estimators=None,
num_boost_round=None,
num_trees=None,
colsample_bylevel=None,
random_state=None, # SEED = 100
reg_lambda=None,
objective=None,
eta=None,
max_bin=None,
gpu_cat_features_storage=None,
data_partition=None,
metadata=None,
early_stopping_rounds=None, # eg. 200
cat_features=None, # [0,1,2]
grow_policy=None,
min_data_in_leaf=None,
min_child_samples=None,
max_leaves=None,
num_leaves=None,
score_function=None,
leaf_estimation_backtracking=None,
ctr_history_unit=None,
monotone_constraints=None
)

Catboost with validation set

In [0]:
cat_features = list(range(Xtrain.shape[1]))
In [17]:
# time
time_start = time.time()

# current parameters
Xtr = Xtrain
Xtx = Xtest
Xvd = Xvalid

ytr,ytx,yvd = ytrain, ytest,yvalid


# fit the model
model_cat = CatBoostClassifier(verbose=100,
                               random_state=SEED,
                               cat_features=list(range(Xtrain.shape[1])))
model.fit(Xtr, ytr,
          eval_set=(Xvd, yvd))


# ypreds
skf=StratifiedKFold(n_splits=5,shuffle=True,random_state=SEED)
ypreds = cross_val_predict(model, Xtx, ytx, cv=skf)

# r-squared values
r = roc_auc_score(ytx, ypreds)

# time
time_taken = time.time() - time_start
print('Time taken: {:.0f} min {:.0f} secs'.format(*divmod(time_taken,60)))

print('ROC AUC Score ', r)
Learning rate set to 0.135518
0:	learn: 0.6583595	test: 0.6566476	best: 0.6566476 (0)	total: 566ms	remaining: 9m 25s
100:	learn: 0.4959756	test: 0.4940167	best: 0.4940167 (100)	total: 49.2s	remaining: 7m 17s
200:	learn: 0.4902050	test: 0.4912977	best: 0.4912902 (197)	total: 1m 40s	remaining: 6m 40s
300:	learn: 0.4868616	test: 0.4909413	best: 0.4909413 (300)	total: 2m 34s	remaining: 5m 57s
400:	learn: 0.4839041	test: 0.4908873	best: 0.4908697 (399)	total: 3m 30s	remaining: 5m 14s
500:	learn: 0.4810793	test: 0.4908629	best: 0.4908485 (441)	total: 4m 26s	remaining: 4m 25s
600:	learn: 0.4782056	test: 0.4908027	best: 0.4907643 (552)	total: 5m 23s	remaining: 3m 35s
700:	learn: 0.4752454	test: 0.4909058	best: 0.4907643 (552)	total: 6m 23s	remaining: 2m 43s
800:	learn: 0.4724945	test: 0.4910555	best: 0.4907643 (552)	total: 7m 21s	remaining: 1m 49s
900:	learn: 0.4699749	test: 0.4911420	best: 0.4907643 (552)	total: 8m 20s	remaining: 55s
999:	learn: 0.4673505	test: 0.4912207	best: 0.4907643 (552)	total: 9m 20s	remaining: 0us

bestTest = 0.4907643281
bestIteration = 552

Shrink model to first 553 iterations.
Learning rate set to 0.050109
0:	learn: 0.6799256	total: 126ms	remaining: 2m 6s
100:	learn: 0.5202867	total: 11.8s	remaining: 1m 45s
200:	learn: 0.5069335	total: 24.1s	remaining: 1m 35s
300:	learn: 0.4994791	total: 36.7s	remaining: 1m 25s
400:	learn: 0.4945142	total: 49.4s	remaining: 1m 13s
500:	learn: 0.4901611	total: 1m 2s	remaining: 1m 2s
600:	learn: 0.4857835	total: 1m 15s	remaining: 50.3s
700:	learn: 0.4819583	total: 1m 29s	remaining: 38s
800:	learn: 0.4779206	total: 1m 42s	remaining: 25.5s
900:	learn: 0.4738034	total: 1m 56s	remaining: 12.8s
999:	learn: 0.4700741	total: 2m 9s	remaining: 0us
Learning rate set to 0.050109
0:	learn: 0.6802688	total: 126ms	remaining: 2m 5s
100:	learn: 0.5221414	total: 11.8s	remaining: 1m 44s
200:	learn: 0.5083656	total: 23.9s	remaining: 1m 35s
300:	learn: 0.5006129	total: 36.6s	remaining: 1m 25s
400:	learn: 0.4952808	total: 49.5s	remaining: 1m 13s
500:	learn: 0.4905567	total: 1m 2s	remaining: 1m 2s
600:	learn: 0.4861503	total: 1m 15s	remaining: 50.4s
700:	learn: 0.4819168	total: 1m 29s	remaining: 38.1s
800:	learn: 0.4775418	total: 1m 42s	remaining: 25.6s
900:	learn: 0.4733301	total: 1m 56s	remaining: 12.8s
999:	learn: 0.4691861	total: 2m 10s	remaining: 0us
Learning rate set to 0.050109
0:	learn: 0.6801782	total: 123ms	remaining: 2m 2s
100:	learn: 0.5213808	total: 11.8s	remaining: 1m 45s
200:	learn: 0.5077076	total: 24.1s	remaining: 1m 35s
300:	learn: 0.4998469	total: 37s	remaining: 1m 25s
400:	learn: 0.4938801	total: 50.1s	remaining: 1m 14s
500:	learn: 0.4891835	total: 1m 2s	remaining: 1m 2s
600:	learn: 0.4849433	total: 1m 16s	remaining: 50.7s
700:	learn: 0.4806865	total: 1m 30s	remaining: 38.4s
800:	learn: 0.4764044	total: 1m 43s	remaining: 25.8s
900:	learn: 0.4723969	total: 1m 57s	remaining: 12.9s
999:	learn: 0.4684500	total: 2m 10s	remaining: 0us
Learning rate set to 0.050109
0:	learn: 0.6802687	total: 127ms	remaining: 2m 6s
100:	learn: 0.5207595	total: 11.7s	remaining: 1m 44s
200:	learn: 0.5077375	total: 23.8s	remaining: 1m 34s
300:	learn: 0.4999608	total: 36.8s	remaining: 1m 25s
400:	learn: 0.4945871	total: 49.6s	remaining: 1m 14s
500:	learn: 0.4901566	total: 1m 2s	remaining: 1m 2s
600:	learn: 0.4857369	total: 1m 15s	remaining: 50.4s
700:	learn: 0.4816380	total: 1m 29s	remaining: 38.2s
800:	learn: 0.4777497	total: 1m 43s	remaining: 25.6s
900:	learn: 0.4742502	total: 1m 56s	remaining: 12.9s
999:	learn: 0.4704576	total: 2m 10s	remaining: 0us
Learning rate set to 0.050109
0:	learn: 0.6804138	total: 128ms	remaining: 2m 7s
100:	learn: 0.5202666	total: 11.7s	remaining: 1m 44s
200:	learn: 0.5069514	total: 23.8s	remaining: 1m 34s
300:	learn: 0.4996495	total: 36.3s	remaining: 1m 24s
400:	learn: 0.4943098	total: 49.5s	remaining: 1m 13s
500:	learn: 0.4897658	total: 1m 2s	remaining: 1m 2s
600:	learn: 0.4856012	total: 1m 16s	remaining: 50.5s
700:	learn: 0.4817596	total: 1m 29s	remaining: 38.1s
800:	learn: 0.4778182	total: 1m 42s	remaining: 25.6s
900:	learn: 0.4738991	total: 1m 56s	remaining: 12.8s
999:	learn: 0.4703251	total: 2m 10s	remaining: 0us
Time taken: 20 min 23 secs
ROC AUC Score  0.6509684966007818

Catboost with Data Pool

In [0]:
from catboost import Pool

dtrain = Pool(
    data=Xtrain, 
    label=ytrain, 
    cat_features=list(range(Xtrain.shape[1]))
)

dvalid = Pool(
    data=Xvalid, 
    label=yvalid, 
    cat_features=list(range(Xtrain.shape[1]))
)

dtest = Pool(
    data=Xtest, 
    label=ytest, 
    cat_features=list(range(Xtrain.shape[1]))
)
In [19]:
model = CatBoostClassifier(
    custom_loss=['AUC', 'Accuracy']
)

model.fit(
    iterations=1000,
    dtrain,
    eval_set=dvalid,
    verbose=False,
    plot=True # does not work in gcolab
);

Model Comparison

In [20]:
model1 = CatBoostClassifier(
    learning_rate=0.7,
    iterations=500,
    train_dir='learing_rate_0.7'
)

model2 = CatBoostClassifier(
    learning_rate=0.01,
    iterations=500,
    train_dir='learing_rate_0.01'
)

model1.fit(dtrain, eval_set=dvalid, verbose=20)
model2.fit(dtrain, eval_set=dvalid, verbose=20);
0:	learn: 0.5772303	test: 0.5769549	best: 0.5769549 (0)	total: 497ms	remaining: 4m 7s
20:	learn: 0.4984191	test: 0.4953011	best: 0.4953011 (20)	total: 10.1s	remaining: 3m 51s
40:	learn: 0.4933419	test: 0.4936795	best: 0.4936021 (37)	total: 20.1s	remaining: 3m 44s
60:	learn: 0.4900469	test: 0.4935439	best: 0.4933050 (56)	total: 30.5s	remaining: 3m 39s
80:	learn: 0.4871709	test: 0.4937018	best: 0.4933050 (56)	total: 41.3s	remaining: 3m 33s
100:	learn: 0.4847402	test: 0.4938329	best: 0.4933050 (56)	total: 51.8s	remaining: 3m 24s
120:	learn: 0.4821804	test: 0.4941487	best: 0.4933050 (56)	total: 1m 2s	remaining: 3m 16s
140:	learn: 0.4797221	test: 0.4947369	best: 0.4933050 (56)	total: 1m 14s	remaining: 3m 9s
160:	learn: 0.4775799	test: 0.4950507	best: 0.4933050 (56)	total: 1m 25s	remaining: 3m
180:	learn: 0.4758627	test: 0.4953148	best: 0.4933050 (56)	total: 1m 36s	remaining: 2m 50s
200:	learn: 0.4734559	test: 0.4958357	best: 0.4933050 (56)	total: 1m 48s	remaining: 2m 41s
220:	learn: 0.4712809	test: 0.4964806	best: 0.4933050 (56)	total: 1m 59s	remaining: 2m 30s
240:	learn: 0.4688911	test: 0.4972335	best: 0.4933050 (56)	total: 2m 10s	remaining: 2m 20s
260:	learn: 0.4663191	test: 0.4985798	best: 0.4933050 (56)	total: 2m 22s	remaining: 2m 10s
280:	learn: 0.4641763	test: 0.4990804	best: 0.4933050 (56)	total: 2m 34s	remaining: 2m
300:	learn: 0.4618618	test: 0.4997467	best: 0.4933050 (56)	total: 2m 45s	remaining: 1m 49s
320:	learn: 0.4596678	test: 0.5003752	best: 0.4933050 (56)	total: 2m 57s	remaining: 1m 38s
340:	learn: 0.4575114	test: 0.5011702	best: 0.4933050 (56)	total: 3m 8s	remaining: 1m 28s
360:	learn: 0.4552547	test: 0.5021788	best: 0.4933050 (56)	total: 3m 20s	remaining: 1m 17s
380:	learn: 0.4532256	test: 0.5024523	best: 0.4933050 (56)	total: 3m 32s	remaining: 1m 6s
400:	learn: 0.4509598	test: 0.5038589	best: 0.4933050 (56)	total: 3m 45s	remaining: 55.6s
420:	learn: 0.4486414	test: 0.5053962	best: 0.4933050 (56)	total: 3m 57s	remaining: 44.5s
440:	learn: 0.4465355	test: 0.5061234	best: 0.4933050 (56)	total: 4m 8s	remaining: 33.3s
460:	learn: 0.4440657	test: 0.5071564	best: 0.4933050 (56)	total: 4m 20s	remaining: 22.1s
480:	learn: 0.4416025	test: 0.5082825	best: 0.4933050 (56)	total: 4m 33s	remaining: 10.8s
499:	learn: 0.4395074	test: 0.5088422	best: 0.4933050 (56)	total: 4m 44s	remaining: 0us

bestTest = 0.4933049582
bestIteration = 56

Shrink model to first 57 iterations.
0:	learn: 0.6903807	test: 0.6903720	best: 0.6903720 (0)	total: 502ms	remaining: 4m 10s
20:	learn: 0.6446107	test: 0.6437511	best: 0.6437511 (20)	total: 9.09s	remaining: 3m 27s
40:	learn: 0.6147433	test: 0.6134378	best: 0.6134378 (40)	total: 18s	remaining: 3m 21s
60:	learn: 0.5945701	test: 0.5931442	best: 0.5931442 (60)	total: 26.8s	remaining: 3m 12s
80:	learn: 0.5801082	test: 0.5786443	best: 0.5786443 (80)	total: 35.6s	remaining: 3m 3s
100:	learn: 0.5694754	test: 0.5679793	best: 0.5679793 (100)	total: 44.4s	remaining: 2m 55s
120:	learn: 0.5613517	test: 0.5597943	best: 0.5597943 (120)	total: 53.2s	remaining: 2m 46s
140:	learn: 0.5548642	test: 0.5532626	best: 0.5532626 (140)	total: 1m 2s	remaining: 2m 38s
160:	learn: 0.5494913	test: 0.5478465	best: 0.5478465 (160)	total: 1m 11s	remaining: 2m 30s
180:	learn: 0.5449544	test: 0.5433028	best: 0.5433028 (180)	total: 1m 20s	remaining: 2m 21s
200:	learn: 0.5410408	test: 0.5393482	best: 0.5393482 (200)	total: 1m 29s	remaining: 2m 12s
220:	learn: 0.5376536	test: 0.5359002	best: 0.5359002 (220)	total: 1m 38s	remaining: 2m 4s
240:	learn: 0.5346332	test: 0.5328760	best: 0.5328760 (240)	total: 1m 47s	remaining: 1m 55s
260:	learn: 0.5318606	test: 0.5300718	best: 0.5300718 (260)	total: 1m 57s	remaining: 1m 47s
280:	learn: 0.5294165	test: 0.5276061	best: 0.5276061 (280)	total: 2m 6s	remaining: 1m 38s
300:	learn: 0.5271424	test: 0.5253131	best: 0.5253131 (300)	total: 2m 15s	remaining: 1m 29s
320:	learn: 0.5250642	test: 0.5231780	best: 0.5231780 (320)	total: 2m 24s	remaining: 1m 20s
340:	learn: 0.5231423	test: 0.5212482	best: 0.5212482 (340)	total: 2m 34s	remaining: 1m 11s
360:	learn: 0.5213828	test: 0.5194347	best: 0.5194347 (360)	total: 2m 43s	remaining: 1m 2s
380:	learn: 0.5197640	test: 0.5178181	best: 0.5178181 (380)	total: 2m 52s	remaining: 53.9s
400:	learn: 0.5182437	test: 0.5162623	best: 0.5162623 (400)	total: 3m 2s	remaining: 45s
420:	learn: 0.5168439	test: 0.5148550	best: 0.5148550 (420)	total: 3m 11s	remaining: 35.9s
440:	learn: 0.5155907	test: 0.5135909	best: 0.5135909 (440)	total: 3m 20s	remaining: 26.8s
460:	learn: 0.5143886	test: 0.5123348	best: 0.5123348 (460)	total: 3m 30s	remaining: 17.8s
480:	learn: 0.5132669	test: 0.5111684	best: 0.5111684 (480)	total: 3m 39s	remaining: 8.68s
499:	learn: 0.5122625	test: 0.5101385	best: 0.5101385 (499)	total: 3m 48s	remaining: 0us

bestTest = 0.5101385318
bestIteration = 499

In [21]:
from catboost import MetricVisualizer
MetricVisualizer(['learing_rate_0.7', 'learing_rate_0.01']).start()

Best iteration

In [23]:
model = CatBoostClassifier(
    iterations=100,
    use_best_model=True
)
model.fit(
    dtrain,
    eval_set=dvalid,
    verbose=False,
    plot=True
);
In [24]:
print('Tree count: ' + str(model.tree_count_))
Tree count: 96

Cross-validation

In [25]:
from catboost import cv

params = {
    'loss_function': 'Logloss',
    'iterations': 80,
    'custom_loss': 'AUC',
    'learning_rate': 0.5,
}

cv_data = cv(
    params = params,
    pool = dtrain,
    fold_count=5,
    shuffle=True,
    partition_random_seed=SEED,
    plot=True,
    verbose=False
)
In [26]:
cv_data.head(10)
Out[26]:
iterations test-Logloss-mean test-Logloss-std train-Logloss-mean train-Logloss-std test-AUC-mean test-AUC-std
0 0 0.5932620499512545 0.0011282819109841257 0.5947366868717514 0.0012498737510975735 0.6897238109939845 0.001362580716786672
1 1 0.5621988281350763 0.0017736351241202236 0.5637152373394128 0.0014475233282403726 0.7265567525412949 0.005521867388878331
2 2 0.5432949408723446 0.0005915424407944017 0.5455035779088517 0.0006272895810017126 0.7474332205542394 0.0012269353436785398
3 3 0.5343284062107131 0.0006182158327314069 0.5372028363582526 0.000712416935775204 0.7587523280651916 0.0016100491239105664
4 4 0.5274138170152894 0.0010526223686857026 0.5306497434751712 0.0005543807409622123 0.766170189003498 0.001896865933796239
5 5 0.5219557957938099 0.0012284101434209213 0.5257729065549982 0.0005667871798786818 0.7726890726204309 0.002296434484192623
6 6 0.5171666350842578 0.0011946214726809585 0.5216157494650073 0.00043510127375951876 0.7780015694778525 0.0028681263463788795
7 7 0.5137713825665815 0.0011937082742051375 0.518103155159091 0.0006228136827349699 0.7787176817214723 0.0021667094467511428
8 8 0.5108000835494815 0.0016589991272503385 0.5155783585112524 0.00039882314007828916 0.7804586544172165 0.0023633601833588326
9 9 0.5086363502030085 0.0015485413599112855 0.5135586036346931 0.0005337306250705626 0.7818302895679985 0.002453211584395374
In [27]:
best_value = cv_data['test-Logloss-mean'].min()
best_iter = cv_data['test-Logloss-mean'].values.argmin()

print('Best validation Logloss score, stratified: {:.4f}±{:.4f} on step {}'.format(
    best_value,
    cv_data['test-Logloss-std'][best_iter],
    best_iter)
)
Best validation Logloss score, stratified: 0.4939±0.0017 on step 62

Sklearn Grid Search

In [58]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "learning_rate": [0.001, 0.01, 0.5],
}

clf = CatBoostClassifier(
    iterations=200, 
    cat_features=cat_features, 
    verbose=100
)
grid_search = GridSearchCV(clf, param_grid=param_grid, cv=3)
results = grid_search.fit(Xtrain, ytrain)
results.best_estimator_.get_params()
0:	learn: 0.6928650	total: 323ms	remaining: 1m 4s
100:	learn: 0.6675430	total: 27.1s	remaining: 26.6s
199:	learn: 0.6472045	total: 53.7s	remaining: 0us
0:	learn: 0.6928667	total: 293ms	remaining: 58.3s
100:	learn: 0.6679049	total: 27.3s	remaining: 26.8s
199:	learn: 0.6476945	total: 53.9s	remaining: 0us
0:	learn: 0.6928627	total: 297ms	remaining: 59s
100:	learn: 0.6676752	total: 27.3s	remaining: 26.8s
199:	learn: 0.6472893	total: 53.8s	remaining: 0us
0:	learn: 0.6903414	total: 296ms	remaining: 59s
100:	learn: 0.5704566	total: 27.6s	remaining: 27.1s
199:	learn: 0.5429356	total: 55.2s	remaining: 0us
0:	learn: 0.6903581	total: 296ms	remaining: 58.8s
100:	learn: 0.5709529	total: 27.5s	remaining: 27s
199:	learn: 0.5435002	total: 55.2s	remaining: 0us
0:	learn: 0.6903187	total: 302ms	remaining: 1m
100:	learn: 0.5700994	total: 27.3s	remaining: 26.8s
199:	learn: 0.5426355	total: 55.2s	remaining: 0us
0:	learn: 0.5940780	total: 297ms	remaining: 59s
100:	learn: 0.4865971	total: 32.2s	remaining: 31.5s
199:	learn: 0.4728993	total: 1m 5s	remaining: 0us
0:	learn: 0.5946238	total: 300ms	remaining: 59.7s
100:	learn: 0.4876544	total: 31.5s	remaining: 30.9s
199:	learn: 0.4740342	total: 1m 3s	remaining: 0us
0:	learn: 0.5935279	total: 302ms	remaining: 1m
100:	learn: 0.4860939	total: 31s	remaining: 30.4s
199:	learn: 0.4720844	total: 1m 4s	remaining: 0us
0:	learn: 0.5953363	total: 452ms	remaining: 1m 29s
100:	learn: 0.4872788	total: 46.1s	remaining: 45.2s
199:	learn: 0.4781562	total: 1m 34s	remaining: 0us
Out[58]:
{'cat_features': [0,
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  22,
  23],
 'iterations': 200,
 'learning_rate': 0.5,
 'verbose': 100}

Overfitting Detector

In [59]:
model_with_early_stop = CatBoostClassifier(
    iterations=200,
    learning_rate=0.5,
    early_stopping_rounds=20
)

model_with_early_stop.fit(
    dtrain,
    eval_set=dvalid,
    verbose=False,
    plot=True
);

print(model_with_early_stop.tree_count_)
52

Overfitting Detector with eval metric

In [60]:
model_with_early_stop = CatBoostClassifier(
    eval_metric='AUC',
    iterations=200,
    learning_rate=0.5,
    early_stopping_rounds=20
)
model_with_early_stop.fit(
    dtrain,
    eval_set=dvalid,
    verbose=False,
    plot=True
);

print(model_with_early_stop.tree_count_)
80

Model predictions

In [61]:
model = CatBoostClassifier(iterations=200, learning_rate=0.03)
model.fit(dtrain, verbose=50);
0:	learn: 0.6849502	total: 462ms	remaining: 1m 31s
50:	learn: 0.5513495	total: 20.8s	remaining: 1m
100:	learn: 0.5268777	total: 42.1s	remaining: 41.3s
150:	learn: 0.5147412	total: 1m 3s	remaining: 20.6s
199:	learn: 0.5078874	total: 1m 24s	remaining: 0us
In [62]:
print(model.predict(dvalid)) # gives 0 and 1 with threhold 0.5
[0. 0. 1. ... 1. 0. 0.]
In [63]:
print(model.predict_proba(dvalid)) # actual probs
[[0.913 0.087]
 [0.546 0.454]
 [0.483 0.517]
 ...
 [0.412 0.588]
 [0.671 0.329]
 [0.836 0.164]]
In [64]:
raw_pred = model.predict(
    dvalid,
    prediction_type='RawFormulaVal'
)

print(raw_pred)
[-2.348 -0.186  0.068 ...  0.354 -0.715 -1.627]
In [65]:
from numpy import exp

sigmoid = lambda x: 1 / (1 + exp(-x))

probabilities = sigmoid(raw_pred)

print(probabilities)
[0.087 0.454 0.517 ... 0.588 0.329 0.164]

Select decision boundary

In [0]:
import matplotlib.pyplot as plt
from catboost.utils import get_roc_curve
from catboost.utils import get_fpr_curve
from catboost.utils import get_fnr_curve

curve = get_roc_curve(model, dvalid)
(fpr, tpr, thresholds) = curve

(thresholds, fpr) = get_fpr_curve(curve=curve)
(thresholds, fnr) = get_fnr_curve(curve=curve)
In [67]:
def plot_fpr_fnr(thresholds):

    plt.figure(figsize=(16, 8))
    style = {'alpha':0.5, 'lw':2}

    plt.plot(thresholds, fpr, color='blue', label='FPR', **style)
    plt.plot(thresholds, fnr, color='green', label='FNR', **style)

    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xticks(fontsize=16)
    plt.yticks(fontsize=16)
    plt.grid(True)
    plt.xlabel('Threshold', fontsize=16)
    plt.ylabel('Error Rate', fontsize=16)
    plt.title('FPR-FNR curves', fontsize=20)
    plt.legend(loc="lower left", fontsize=16);


plot_fpr_fnr(thresholds)
In [68]:
from catboost.utils import select_threshold

print(select_threshold(model, dvalid, FNR=0.01))
print(select_threshold(model, dvalid, FPR=0.01))
0.10708726102884648
0.651569873964769

Metric evaluation on a new dataset

In [69]:
metrics = model.eval_metrics(
    data=dvalid,
    metrics=['Logloss','AUC'],
    ntree_start=0,
    ntree_end=0,
    eval_period=1,
    plot=True
)
In [70]:
print('AUC values:\n{}'.format(np.array(metrics['AUC'])))
AUC values:
[0.687 0.706 0.709 0.711 0.713 0.714 0.718 0.719 0.72  0.72  0.721 0.721
 0.722 0.724 0.725 0.727 0.728 0.728 0.729 0.73  0.731 0.732 0.734 0.735
 0.738 0.739 0.74  0.741 0.742 0.743 0.744 0.746 0.746 0.747 0.748 0.748
 0.749 0.75  0.752 0.753 0.754 0.754 0.754 0.755 0.757 0.757 0.757 0.758
 0.759 0.759 0.76  0.761 0.761 0.762 0.763 0.763 0.764 0.764 0.764 0.765
 0.765 0.765 0.766 0.766 0.766 0.767 0.767 0.768 0.769 0.769 0.769 0.77
 0.77  0.771 0.771 0.771 0.771 0.772 0.772 0.772 0.772 0.773 0.774 0.774
 0.775 0.775 0.775 0.776 0.776 0.776 0.776 0.777 0.777 0.777 0.777 0.778
 0.778 0.778 0.778 0.779 0.779 0.779 0.779 0.779 0.779 0.78  0.78  0.78
 0.78  0.78  0.78  0.78  0.781 0.781 0.781 0.781 0.781 0.782 0.782 0.782
 0.782 0.782 0.783 0.783 0.783 0.783 0.783 0.783 0.783 0.783 0.784 0.784
 0.784 0.784 0.784 0.784 0.784 0.784 0.784 0.785 0.785 0.785 0.785 0.785
 0.785 0.785 0.785 0.785 0.785 0.786 0.786 0.786 0.786 0.786 0.786 0.786
 0.786 0.786 0.786 0.786 0.786 0.787 0.787 0.787 0.787 0.787 0.787 0.787
 0.787 0.787 0.787 0.787 0.787 0.787 0.788 0.788 0.788 0.788 0.788 0.788
 0.788 0.788 0.788 0.788 0.788 0.788 0.788 0.788 0.788 0.788 0.788 0.789
 0.789 0.789 0.789 0.789 0.789 0.789 0.789 0.789]

Feature importances

Prediction values change

Default feature importances for binary classification is PredictionValueChange - how much on average does the model change when the feature value changes. These feature importances are non negative. They are normalized and sum to 1, so you can look on these values like percentage of importance.

In [71]:
np.array(model.get_feature_importance(prettified=True))
Out[71]:
array([['ord_5', 14.019172929038193],
       ['ord_4', 11.013273002912854],
       ['ord_2', 10.760120094510224],
       ['ord_1', 8.82999848618588],
       ['ord_3', 6.949184021791872],
       ['nom_0', 5.75762900497359],
       ['nom_5', 5.212791223118736],
       ['bin_1', 4.635057588031899],
       ['month', 4.529343398805925],
       ['nom_6', 4.2499072113619825],
       ['nom_4', 3.923289123169229],
       ['ord_0', 3.7485003165342454],
       ['nom_7', 3.3917820524488533],
       ['nom_8', 2.87528262167044],
       ['nom_3', 2.497496666504473],
       ['nom_2', 2.1303999848946003],
       ['nom_1', 2.0748000971005953],
       ['day', 1.718567149169596],
       ['bin_4', 1.3593827041199709],
       ['nom_9', 0.32402232365690914],
       ['id', 0.0],
       ['bin_0', 0.0],
       ['bin_2', 0.0],
       ['bin_3', 0.0]], dtype=object)

Loss function change

The non default feature importance approximates how much the optimized loss function will change if the value of the feature changes. This importances might be negative if the feature has bad influence on the loss function. The importances are not normalized, the absolute value of the importance has the same scale as the optimized loss value. To calculate this importance value you need to pass train_pool as an argument.

In [72]:
np.array(model.get_feature_importance(
    dtrain, 
    'LossFunctionChange', 
    prettified=True
))
Out[72]:
array([['ord_5', 0.012284234982247816],
       ['ord_4', 0.00644095104664478],
       ['ord_2', 0.00613197370387888],
       ['nom_6', 0.005695383941560134],
       ['nom_7', 0.0056109341751375275],
       ['nom_8', 0.00560591706498792],
       ['ord_1', 0.005466537491922695],
       ['nom_5', 0.005406589833721748],
       ['ord_3', 0.004213682954066823],
       ['nom_0', 0.003974547456564953],
       ['month', 0.003237605157187758],
       ['nom_4', 0.002916333163498111],
       ['ord_0', 0.002577559338077568],
       ['nom_3', 0.0020057499020414526],
       ['nom_1', 0.001911501086938291],
       ['bin_1', 0.0019033104906682197],
       ['nom_2', 0.0018799555046448934],
       ['nom_9', 0.001689629188606212],
       ['day', 0.0015960392057600462],
       ['bin_4', 0.0005841335998900826],
       ['id', 0.0],
       ['bin_0', 0.0],
       ['bin_2', 0.0],
       ['bin_3', 0.0]], dtype=object)

Shap values

In [73]:
print(model.predict_proba([Xvalid.iloc[1,:]]))
print(model.predict_proba([Xvalid.iloc[91,:]]))
[[0.546 0.454]]
[[0.7 0.3]]
In [74]:
shap_values = model.get_feature_importance(
    dvalid, 
    'ShapValues'
)
expected_value = shap_values[0,-1]
shap_values = shap_values[:,:-1]
print(shap_values.shape)
(48000, 24)
In [75]:
proba = model.predict_proba([Xvalid.iloc[1,:]])[0]
raw = model.predict([Xvalid.iloc[1,:]], prediction_type='RawFormulaVal')[0]
print('Probabilities', proba)
print('Raw formula value %.4f' % raw)
print('Probability from raw value %.4f' % sigmoid(raw))
Probabilities [0.546 0.454]
Raw formula value -0.1860
Probability from raw value 0.4536
In [76]:
import shap

shap.initjs()
shap.force_plot(expected_value, shap_values[1,:], Xvalid.iloc[1,:])
Out[76]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [77]:
proba = model.predict_proba([Xvalid.iloc[91,:]])[0]
raw = model.predict([Xvalid.iloc[91,:]], prediction_type='RawFormulaVal')[0]
print('Probabilities', proba)
print('Raw formula value %.4f' % raw)
print('Probability from raw value %.4f' % sigmoid(raw))
Probabilities [0.7 0.3]
Raw formula value -0.8479
Probability from raw value 0.2999
In [78]:
import shap
shap.initjs()
shap.force_plot(expected_value, shap_values[91,:], Xvalid.iloc[91,:])
Out[78]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [79]:
shap.summary_plot(shap_values, Xvalid)

Snapshotting

In [80]:
#!rm 'catboost_info/snapshot.bkp'

model = CatBoostClassifier(
    iterations=100,
    save_snapshot=True,
    snapshot_file='snapshot.bkp',
    snapshot_interval=1
)

model.fit(dtrain, eval_set=dvalid, verbose=10);
Learning rate set to 0.363076

bestTest = 0.4928309227
bestIteration = 95

Shrink model to first 96 iterations.

Saving the model

In [0]:
model = CatBoostClassifier(iterations=10)
model.fit(dtrain, eval_set=dvalid, verbose=False)
model.save_model('catboost_model.bin')
model.save_model('catboost_model.json', format='json')
In [82]:
model.load_model('catboost_model.bin')
print(model.get_params())
print(model.learning_rate_)
{'iterations': 10, 'loss_function': 'Logloss', 'logging_level': 'Silent', 'verbose': 0}
0.5

Hyperparameter tunning

In [83]:
tunned_model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.03,
    depth=6,
    l2_leaf_reg=3,
    random_strength=1,
    bagging_temperature=1
)

tunned_model.fit(
    Xtrain, ytrain,
    cat_features=cat_features,
    verbose=False,
    eval_set=(Xvalid, yvalid),
    plot=True
);

Speeding up the training

In [84]:
fast_model = CatBoostClassifier(
    boosting_type='Plain',
    rsm=0.5,
    one_hot_max_size=50,
    leaf_estimation_iterations=1,
    max_ctr_complexity=1,
    iterations=100,
    learning_rate=0.3,
    bootstrap_type='Bernoulli',
    subsample=0.5
)
fast_model.fit(
    Xtrain, ytrain,
    cat_features=cat_features,
    verbose=False,
    eval_set=(Xvalid, yvalid),
    plot=True
);

Reducing model size

In [85]:
small_model = CatBoostClassifier(
    learning_rate=0.03,
    iterations=500,
    model_size_reg=50,
    max_ctr_complexity=1,
    ctr_leaf_count_limit=100
)
small_model.fit(
    Xtrain, ytrain,
    cat_features=cat_features,
    verbose=False,
    eval_set=(Xvalid, yvalid),
    plot=True
);